Skip to content

Conversation

parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Aug 13, 2025

The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message found 5 errors would be separated into the argument 5, and the template found %W errors, where %W is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string.

This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one GENERIC argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.

dataInput.writeVInt(arguments.size());
for (var arg : arguments) {
dataInput.writeVInt(arg.type.toCode());
dataInput.writeVInt(arg.offsetInTemplate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the offset? If we keep the generic placeholder in the template, we can get the offsets from the latter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test adding the placeholder to the template, then scanning the template before message reconstruction to find offsets. This was quite fast in a micro-benchmark, only about 2x slower that the stored offsets in this PR. But this method does not deal with the other issue which this PR addresses: handling messages which contain the placeholder value. There are certainly ways we could deal with this, but those seems complicated.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change that sets us up for the future.


// Add args schema
String argsSchemaEncoded = Arg.encodeSchema(parts.schemas());
context.doc().add(new SortedSetDocValuesField(fieldType().argsSchemaFieldName(), new BytesRef(argsSchemaEncoded)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think schema field name can be stored using SortedDocValuesField? Given that encodeSchema (...) stores the schemas as one value so store only one value per document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, they should be able to be stored as regular SortedDocValues. This is true for all the doc values columns in the patterned_text type. But I ran into an issue where a mapper test class defined in Lucene did not handle SortedDocValues correctly. I submitted this fix: apache/lucene#14839, and it has been merged. If we're using 10.3, I could go ahead and update all doc values in this type to SortedDocValues. But I'm inclined to do it in a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, and thanks for fixing this in Lucene 🚀

int prevArgOffset = 0;
for (String token : tokens) {
if (token.isEmpty()) {
// add the previous delimiter
Copy link
Contributor

@kkrik-es kkrik-es Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we can use (double space) to track the presence of an arg, and replace all other whitespace with (single space) in the template. That will help avoid tracking offsets in arg schema and further compress the template, reducing the storage footprint. The reconstructed msg will look slightly different, but afaict whitespaces are barely used in term and phrase queries.

That may impact reconstruction performance, though. The logic is very streamlined, currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to reconstruct the original message exactly, at least for the time being. This simplify testing as no changes need to be taken into account when testing for equality between synthetic and stored source.

@parkertimmins parkertimmins changed the title Encode args in a schema doc value rather than using placeholders Encode args in a info doc value rather than using placeholders Aug 27, 2025
@parkertimmins parkertimmins marked this pull request as ready for review August 28, 2025 03:11
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Aug 28, 2025
@parkertimmins parkertimmins added >non-issue :StorageEngine/Mapping The storage related side of mappings and removed needs:triage Requires assignment of a team area label labels Aug 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@parkertimmins parkertimmins merged commit 3b25b97 into elastic:main Aug 28, 2025
33 checks passed
JeremyDahlgren pushed a commit to JeremyDahlgren/elasticsearch that referenced this pull request Aug 29, 2025
…ic#132782)

The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message found 5 errors would be separated into the argument 5, and the template found %W errors, where %W is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string.

This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one GENERIC argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants